Collaborative Filtering with Python (Python 2.7)

Originate from Salem Marafi

<img src="pic/collaborativeFiltering.jpg", width="700" height="900">

The Last.FM dataset

The data set contains information about users, gender, age, and which artists they have listened to on Last.FM. In our case we only use Germany’s data and transform the data into a frequency matrix.

We will use this to complete 2 types of collaborative filtering:

  • Item Based: which takes similarities between items’ consumption histories

  • User Based: that considers similarities between user consumption histories and item similarities


In [5]:
import pandas as pd
from scipy.spatial.distance import cosine

# Data was already dlownloaded.
data = pd.read_csv('data/lastfm/lastfm-matrix-germany.csv')

# check out the data set you can do so using data.head():
data.head(6).ix[:,2:10]


Out[5]:
abba ac/dc adam green aerosmith afi air alanis morissette alexisonfire
0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 1

Item Based Collaborative Filtering


In [6]:
#In item based collaborative filtering we do not care about the user column.
data_germany = data.drop('user', 1)

In [7]:
#Create a placeholder dataframe listing item vs. item
data_ibs = pd.DataFrame(index=data_germany.columns, columns=data_germany.columns)

Now we can start to look at filling in similarities. We will use Cosin Similarities. In Python, the Scipy library has a function that allows us to do this without customization.

In essense the cosine similarity takes the sum product of the first and second column, then dives that by the product of the square root of the sum of squares of each column.


In [8]:
# Lets fill in those empty spaces with cosine similarities
# Loop through the columns

for i in range(0, len(data_ibs.columns)) :
    # Loop through the columns for each column

    for j in range(0,len(data_ibs.columns)) :
        # Fill in placeholder with cosine similarities
        
        data_ibs.ix[i,j] = 1 - cosine(data_germany.ix[:,i], data_germany.ix[:,j])

In [10]:
# Create a placeholder items for closes neighbours to an item
data_neighbours = pd.DataFrame(index=data_ibs.columns,columns=range(1,11))
 
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(data_ibs.columns)):
    data_neighbours.ix[i,:10] = data_ibs.ix[0:,i].order(ascending=False)[:10].index
 
# --- End Item Based Recommendations --- #


/Users/xiaoweiyang/py27/lib/python2.7/site-packages/ipykernel_launcher.py:6: FutureWarning: order is deprecated, use sort_values(...)
  

With our similarity matrix filled out we can look for each items “neighbour” by looping through ‘data_ibs’, sorting each column in descending order, and grabbing the name of each of the top 10 songs.


In [10]:
# Create a placeholder items for closes neighbours to an item
data_neighbours = pd.DataFrame(index=data_ibs.columns, columns=range(1,11))
 
# Loop through our similarity dataframe and fill in neighbouring item names
for i in range(0,len(data_ibs.columns)):
    data_neighbours.ix[i,:10] = data_ibs.ix[0:,i].order(ascending=False)[:10].index


/Users/xiaoweiyang/py27/lib/python2.7/site-packages/ipykernel_launcher.py:6: FutureWarning: order is deprecated, use sort_values(...)
  

In [ ]:
Show the results!

In [21]:
data_neighbours.ix[:10, :5]


Out[21]:
1 2 3 4 5
a perfect circle a perfect circle tool dredg deftones porcupine tree
abba abba madonna robbie williams elvis presley michael jackson
ac/dc ac/dc red hot chili peppers metallica iron maiden the offspring
adam green adam green the libertines the strokes babyshambles radiohead
aerosmith aerosmith u2 led zeppelin metallica ac/dc
afi afi funeral for a friend rise against fall out boy anti-flag
air air massive attack goldfrapp morcheeba thievery corporation
alanis morissette alanis morissette tori amos alicia keys red hot chili peppers kelly clarkson
alexisonfire alexisonfire atreyu underoath funeral for a friend silverstein
alicia keys alicia keys beyonce norah jones maria mena black eyed peas

User Based collaborative Filtering

The process for creating a User Based recommendation system:

  • Have an Item Based similarity matrix at your disposal (DONE)

  • Check which items the user has consumed (listened or purchased): if consumed, then we do not recommend.

  • Otherwise,

    • Find the top N neighbours for the current song
    • Get the consumption record (#listen) of the user for each neighbour.
    • Using similarity scores as weight to average the consumption records.
  • Recommend the songs with the highest score (i.e., weighted average of consumptions)

We first need a formula. We first calcuate the inner product of two vectors (the one containing purchase history; and the one containing similarity scores to the current song), then divide that figure by the sum of the similarities in the respective vector.


In [60]:
# Helper function to get similarity scores
def getScore(history, similarities):
   return sum(history * similarities) / sum(similarities)

The rest is a matter of applying this function to the data frames in the right way. We start by creating a variable to hold our similarity data. This is basically the same as our original data but with nothing filled in except the headers.


In [61]:
# Create a place holder matrix for similarities, and fill in the user name column
data_sims = pd.DataFrame(index=data.index,columns=data.columns)
data_sims.ix[:,:1] = data.ix[:,:1]

In [63]:
data_sims.head(3).ix[:, :10]


Out[63]:
user a perfect circle abba ac/dc adam green aerosmith afi air alanis morissette alexisonfire
0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 33 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 42 NaN NaN NaN NaN NaN NaN NaN NaN NaN

We now loop through the rows and columns filling in empty spaces with similarity scores.

Note that we score items that the user has already consumed as 0, because there is no point recommending it again.


In [64]:
#Loop through all rows, skip the user column, and fill with similarity scores
for i in range(0, len(data_sims.index)):
    for j in range(1,len(data_sims.columns)):
        user = data_sims.index[i]
        product = data_sims.columns[j]
 
        if data.ix[i][j] == 1:
            data_sims.ix[i][j] = 0
        else:
            product_top_names = data_neighbours.ix[product][1:10]
            product_top_sims = data_ibs.ix[product].order(ascending=False)[1:10]
            user_purchases = data_germany.ix[user, product_top_names]
            data_sims.ix[i][j] = getScore(user_purchases, product_top_sims)


/Users/xiaoweiyang/py27/lib/python2.7/site-packages/ipykernel_launcher.py:11: FutureWarning: order is deprecated, use sort_values(...)
  # This is added back by InteractiveShellApp.init_path()

We can now produc a matrix of User Based recommendations as follows:


In [68]:
# Get the top songs
data_recommend = pd.DataFrame(index=data_sims.index, columns=['user','1','2','3','4','5','6'])
data_recommend.ix[0:,0] = data_sims.ix[:,0]

Instead of having the matrix filled with similarity scores, however, it would be nice to see the song names. This can be done with the following loop:


In [69]:
# Instead of top song scores, we want to see names
for i in range(0,len(data_sims.index)):
    data_recommend.ix[i,1:] = data_sims.ix[i,:].order(ascending=False).ix[1:7,].index.transpose()


/Users/xiaoweiyang/py27/lib/python2.7/site-packages/ipykernel_launcher.py:3: FutureWarning: order is deprecated, use sort_values(...)
  This is separate from the ipykernel package so we can avoid doing imports until

In [70]:
# Print a sample
print data_recommend.ix[:4,:5]


  user                      1              2                3              4
0    1         flogging molly       coldplay        aerosmith    the beatles
1   33  red hot chili peppers  kings of leon        peter fox      gentleman
2   42                 oomph!    lacuna coil        rammstein     schandmaul
3   51            the subways      the kooks  franz ferdinand      the hives
4   62           jack johnson        incubus       mando diao  the fratellis

In [ ]: